If you think that spatial data is just a set of longitude and latitude observations for a given area of geographic space, and that it should be treated as such regardless, you’re missing out on opportunities and have a high chances of incurring in meaningless analysis. Spatial data is not, in fact, just a set of records of observable events and/or facts in geographical space. They are the (potential) starting point for answering relevant questions about various problems in a given reality.
In the New Era of Big Data, where decision-making is increasingly data-driven, Spatial Data Science has emerged as an essential field, with applications spanning many sectors of society (e.g. agriculture, environmental science, urban planning, clean energy generation, sanitation, water management, and much more.
For these and many other reasons, I challenge you to reflect on your convictions about Spatial Data, this important entity in the context of Big Data. A lot of things has changed, is changing and will change faster and faster, especially due to the advance of visual and computational technologies. No matter what your area of interest in this field, these changes presuppose, and to some extent demand, that you develop new skills.
At some point in your life you’ve heard someone say that “R is a powerful tool”. This is true, and you don’t need to make a big effort to find n reasons why it is. WWhen I wrote this notebook (using R Markdown), I did not it thinking about explaining what R is and how it works. The purpose of this notebook goes beyond this part, which I assume you already know. After all, my goal with this notebook is to demonstrate, with one possibility, how you can use R to start your studies in Spatial Data Science.
In this section, I start using spatial data about the occurrence (point geometries) of land degraded by erosion processes in natural and antropogenic ecosystems in a watershed located in the Brazilian Semiarid Region. Specifically, the data was collected and organized by me, after successive efforts in various field works in the state of Bahia, between the year 2022 and 2024.
“Land degradation is understood as the reduction of soil capacity to generate assets and services, in qualitative and quantitative terms, due to the decline of its productive potential and of its environmental regulation capacity” (LAL, 2012). From this perspective, land degradation is a complex process, associated with natural factors and human activities. According to Silva (2023), the land degradation is closely related to landscape degradation, considering the soil-landscape relationship.
On figure below I present a record (21-06-2023/17:06pm), organized by me, of an area highly degraded by an erosive process (Ravina), with a strong impact in landscape.
To reproduce this document, you need to have R installed on your machine, as well RStudio. RStudio is an Integrated Development Environment (IDE) for R.
In the chunk below you will set your Working Directory. To do this
you use setwd( ) and pass the Pass the address.
setwd("D:/Documents/Music/PROJECTSR/SpatialDataScienceinR")
In the chunk below you will install the necessary packages. As I
already have these packages installed, note that I left the command as a
comment, adding the #. If you don’t have these packages
installed, just remove the # and run. To install a
package you use install.packages( ).
The first package to install is leafleat, a
visualization tool that is highly recommended for those who want
Interactive Maps.
The second package to install is sf (Simple Features)
package, a very useful feature for those who work with maps, very
interesting for representing and working with spatial vector data
including points, polygons, and lines.
#install.packages("leafleat")
#install.packages("sf")
Once installed, in the chunk below you will load the installed
packages. All you have to do is use library( ).
library(leaflet)
library(sf)
Here, in the chunk below, you will read the spatial data, in this case, a point geometries (shp.). Note that it will be stored in a variable named shape.
shape <- st_read('D:/Documents/Music/PROJECTSR/SpatialDataScienceinR/SpatialData/LandDegradation.shp')
## Reading layer `LandDegradation' from data source
## `D:\Documents\Music\PROJECTSR\SpatialDataScienceinR\SpatialData\LandDegradation.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 2146 features and 12 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -40.69852 ymin: -11.37433 xmax: -40.67941 ymax: -11.34879
## Geodetic CRS: WGS 84
In the chunk below you can create your first interactive map using
the leaflet package. In this situation, your interactive
map show the information of land degradation occurrence. Note that, the
infomation is presented by cluster. If you zoom in, you’ll see that the
clusters are gradually dissolving. This is a very interesting approach
when you have a large number of points that can visually pollute the
map.
leaflet(shape) %>%
setView(lat = -11.351051, lng = -40.684630, zoom = 16) %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Esri.WorldImagery") %>%
addProviderTiles(providers$OpenStreetMap, group = "Open Street Map") %>%
addLayersControl(baseGroups = c("ESRI World Imagery", "Open Street Map"),
options = layersControlOptions(collapsed = TRUE)) %>%
addMarkers(clusterOptions = markerClusterOptions(),
label = shape$Tipo,
labelOptions = labelOptions(
textsize = "14px",
style = list(
"font-weight" = "bold",
padding = "5px",
lat = ~Lat, lng = ~Long)))
In the chunk below you can create your second interactive map using
the leaflet package. Note that this time the information is
categorized by the Tipo field (Type of Erosion), and is
also associated with colors (Yellow, Orange and Red) that reflect the
severity of the process.
Tipo <- c("Boçoroca", "Ravina", "Sulcos")
my_palette <- c("red", "orange", "yellow")
#previewColors(colorFactor(my_palette, levels = Tipo), Tipo)
factpal <- colorFactor(my_palette, levels = Tipo)
groups = unique(shape$Tipo)
map = leaflet(shape) %>%
setView(lat = -11.351051, lng = -40.684630, zoom = 17.3) %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Esri.WorldImagery")
for (i in groups) {
data = shape[shape$Tipo == i, ]
map = map %>% addCircleMarkers(data = data, radius = 4,
weight = 2, opacity = 1, fill = TRUE, fillOpacity = 0.2, color = ~factpal(Tipo), group = i)}
map %>% addLayersControl(overlayGroups = groups, options = layersControlOptions(collapsed = FALSE))
A few background questions:
In this section, I will use spatial data (Polygon geometries) about the population density in a neighborhood of Formosa City, Goiás, Brazil. Specifically, the raw data describing the number of inhabitants at the census sector scale was collected and made available digitally by the Brazilian Institute of Geography and Statistics (IBGE). It was therefore up to me to acquire, organize and process them (I did it at an previous stage using resources of the GIS/QGIs environment) in order to arrive at the population density.
Population density is very useful information in a variety of situations involving decision-making in cities, but not only E. g., population density information may be relevant for identifying, mapping and monitoring sectors where there is a high risk of COVID-19 infection.
On figure below I present a record (13-02-2024/14:19pm), organized by me, of an street located in the Formosinha neighborhood, in the city of Formosa, Goiás, Brazil. Looking at the context (presence of buildings, arrangement of the houses…), you have indications that this is a sector where there is a considerable population density.
library(leaflet)
library(sf)
library(RColorBrewer)
library(classInt)
Here, in the chunk below, you will read the spatial data, in this case, a polygon geometries (shp.). Note that it will be stored in a variable named Population.
Population <- st_read('D:/Documents/Music/PROJECTSR/SpatialDataScienceinR/SpatialData/PopulationDensity.shp')
## Reading layer `PopulationDensity' from data source
## `D:\Documents\Music\PROJECTSR\SpatialDataScienceinR\SpatialData\PopulationDensity.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 21 features and 4 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -47.33796 ymin: -15.56236 xmax: -47.32025 ymax: -15.54231
## Geodetic CRS: WGS 84
If your data is in another SRC, on chunk below you can transform to leaflet projection. In this particular case, this does not apply because the data being read is already in the correct SRC.
#Population <- st_transform(Population, crs = '+proj=longlat +datum=WGS84')
Once this is understood, with the chunk below command you can plot the interactive map.
leaflet() %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Esri.WorldImagery") %>%
addProviderTiles(providers$OpenStreetMap, group = "Open Street Map") %>%
addLayersControl(baseGroups = c("ESRI World Imagery", "Open Street Map"),
options = layersControlOptions(collapsed = TRUE)) %>%
setView(lng = -47.33, lat = -15.55, zoom = 14) %>%
addMiniMap(width = 150, height = 150) %>%
addPolygons(data = Population, color = "black", stroke = 1, opacity = 0.8)
With the chunk command below, set the parameters to create a map categorized by the population density of the neighborhood.
pal_fun <- colorQuantile("YlOrRd", NULL, n = 5)
p_density <- paste0("<strong>Densid_Ha: </strong>", Population$Densid_Ha)
breaks_qt <- classIntervals(c(min(Population$Densid_Ha) - .00001, Population$Densid_Ha), n = 5, style = "jenks")
Considering the defined parameters, here you can plot and see the interactive map:
leaflet(Population) %>%
addPolygons(
stroke = TRUE,
color = "black",
fillColor = ~pal_fun(Densid_Ha),
fillOpacity = 0.8, smoothFactor = 0.5,
popup = p_density,
group = "Population") %>%
addTiles(group = "OSM") %>%
addProviderTiles("Esri.WorldImagery", group = "ESRI World Imagery") %>%
setView(lng = -47.33, lat = -15.55, zoom = 14) %>%
addLegend("bottomright",
colors = brewer.pal(5, "YlOrRd"),
labels = paste0("Up to", format(breaks_qt$brks[-1], digits = 2)),
title = 'Population Density (h/ha)') %>% addLayersControl(baseGroups = c("OSM", "ESRI World Imagery"),
overlayGroups = c("Population"))
As obvious as it may seem, when working with spatial data, ALWAYS REMEMBER: data can’t explain itself, because it can’t talk; the map doesn’t show/indicate/suggest, because the map doesn’t have a life of its own. YOU are expected to do this, and what’s more, YOU are expected to see beyond the data. The limited set of data we are working with is not always enough to answer our questions, it is up to you to know the limits, the power of generalization of the information.
Thank you!
A Correlogram is a correlation matrix plot where you can visualize the correlation coefficients between n variables. Correlograms are useful in Exploratory Data Analysis (EDA) and in understanding relationships between variables. They can help identify potential patterns or dependencies in the data, which is crucial for making informed decisions in fields like, Geography, Economics, Sociology, Biology, and many others where understanding relationships between variables is important.
In this section, I will use spatial data to calculate and visualize a correlation matrix. The data that will be used was obtained using different methods, and falls within the spatial context of the largest fragment of natural vegetation cover in the city of Formosa, Goiás, Brazil: the Mata da Bica Municipal Park (MBMP).
In this notebook I will use four libraries, leaflet, readxl and corrplot. The first will generate an interactive map of the spatial context of Mata da Bica Municipal Park, in the city of Formosa, Goiás, Brazil. The second allows you to read the data in the format that is stored in an .xlsx spreadsheet; and the third allows you to generate the correlogram. If you don’t have the libraries installed, to reproduce this notebook you need to install them.
Once installed, in the chunk below you will load the installed packages.
library(leaflet)
library(readxl)
library(corrplot)
## corrplot 0.92 loaded
Using the command below, you can plot the interactive map to visualize Mata da Bica Municipal Park, the largest forest fragment of remaining natural vegetation cover still existing in the city of Formosa, Goiás, Brazil.
MBMP_cord <- c(-47.336470147649806, -15.553588713481723)
MBMP <- leaflet() %>%
setView(lng = MBMP_cord[1], lat = MBMP_cord[2], zoom = 16.3) %>%
addProviderTiles(providers$Esri.WorldImagery, group = "ESRI World Imagery") %>%
addProviderTiles(providers$OpenStreetMap, group = "Open Street Map") %>%
addLayersControl(
baseGroups = c("ESRI World Imagery","Open Street Map"),
options = layersControlOptions(collapsed = FALSE)
)
MBMP
If you don’t know Mata da Bica Municipal Park, I realy recommend that you get to know this important Urban Green Space for Recreation. In the picture below I present a record (13-02-2024/10:49am), organized by me, of a characteristic landscape on Mata da Bica Municipal Park. In my opinion, an encouraging landscape for a walk around the lake.
As I said, the data that will be used was obtained using different methods:
Enhanced Vegetation Index (EVI): obtained from the digital processing of Sentinel-2A images.
Green Normalized Difference Vegetation Index (GNDVI): obtained from the digital processing of Sentinel-2A images.
Moisture Stress Index (MSI): obtained from the digital processing of Sentinel-2A images.
Number of Days with Precipitation (NDP): obtained from records measured on site by the National Meteorological Institute, Conventional Station Nº. 83379.
Normalized Difference Water Index (NDWI): obtained from the digital processing of Sentinel-2A images.
Monthly Precipitation (P): obtained from records measured on site by the National Meteorological Institute, Conventional Station Nº. 83379.
Total Insolation (TI): obtained from records measured on site by the National Meteorological Institute, Conventional Station Nº. 83379.
Mean Relative Air Humidity (RAHm): obtained from records measured on site by the National Meteorological Institute, Conventional Station Nº. 83379.
In the chunk below you will set your Working Directory.
setwd("D:/Documents/Music/PROJECTSR/SpatialDataScienceinR")
In the chunk below you will read the data.
Data <- read_excel("D:/Documents/Music/PROJECTSR/SpatialDataScienceinR/XLSX/DF_1.xlsx",
sheet = "df", col_types = c("numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric"))
In the chunk below, you will calculate the correlation matrix for the n (in this case 8) variables already highlighted (EVI, GNDVI, MSI, NDP, NDWI, P, RAHm, TI).
cor_matrix = cor(Data)
In the chunk below, you will plot the correlogram.
corrplot(cor_matrix,
type = "lower",
method = "square",
tl.col = "black",
bg = "White",
addCoef.col = "black",
tl.srt = 45)
It is so important to mention that, this is not the only way to
achieve this result, of course. You can modify the aesthetics of the
correlogram in different ways. E. g.: if you just change
'method=square' to 'method=circle', you will
get a graph with a new profile.
Two background question for you:
What relevant information can you extract from this graph?
Assuming that the data is indeed reliable, how do you explain such a high (0.99) negative correlation between the MSI and NDWI variables?
Thank you.
The acronym “SIDRA” refers to the Institute of Geography and Statistics (IBGE) Automatic Retrieval System, which is a platform developed to access and query statistical data produced by the institute itself. SIDRA provides access to a wide range of statistical information on various topics such as population, economy, agriculture, industry, trade, and much more.
SIDRA is widely used by researchers, academics, professionals, and the general public interested in statistical data about Brazil. It is a fundamental tool for academic studies, economic analysis, public policy planning, and business decision-making. In summary, SIDRA is an essential platform for accessing and exploring comprehensive and updated statistical data provided by the IBGE.
In this section, I will use data provided by SIDRA. The purpose of this notebook is to demonstrate, as one possibility, how to import and read a spreadsheet (.csv) downloaded from SIDRA, perform data preprocessing and processing, and finally plot a bar chart.
The data to be used were downloaded directly from SIDRA, within the scope of the Municipal Livestock Survey 2022 (PPM), which aims to provide statistical information on livestock herds, shorn sheep, milked cows, animal products, and aquaculture production. In this particular case, the survey is annual and covers the entire national territory, with information for Brazil, Geographic Regions, Federative Units, Geographic Mesoregions, Geographic Microregions, and Municipalities, currently covering the historical series from 1974 to 2022.
Considering the PPM-Tables, specifically, the data I use provide information on animal products by type. The animal product type I am using is the quantity of Chicken Eggs (Thousand dozens), with the territorial units being Brazil and the state of Goiás (1974-2022).
Eggs are one of the most consumed foods in the world. In many countries, eggs are an essential component of families’ diets due to their high nutritional value. In recent years, it has been observed that egg consumption has been gradually increasing in Brazil. Considering the data provided by IBGE, let’s observe the dynamics of egg production in Brazil and, particularly, in the state of Goiás, an important player in animal product production.
In the chunk below you will set your Working Directory.
setwd("D:/Documents/Music/PROJECTSR/SpatialDataScienceinR")
In this notebook I will use just one library: ggplot2. If you don’t have this library installed, to reproduce this notebook you need to install.
Once installed, in the chunk below you will load it.
library(ggplot2)
In the chunk below you will read the data (ignoring the first three lines, reading the next three lines).
SIDRA_Data <- read.csv2("D:/Documents/Music/PROJECTSR/SpatialDataScienceinR/CSV/avicultura_br_and_go_mil_duzia.csv", skip = 3, nrows = 3)
In the chunk below, you will visualize SIDRA_Data.
View(SIDRA_Data)
I need to remove the first row (Chicken eggs;Chicken eggs;Chicken eggs;Eggs….) and the first column (Brazil.and.Unit.of.Federation).
SIDRA_Data <- SIDRA_Data[-1, -1]
In the chunk below, you can visualize SIDRA_Data after removing the fields highlighted above.
View(SIDRA_Data)
Now you need to remove the string “X” from the column headings (X1974, X1975, X1976…). You can do it:
colnames(SIDRA_Data) <- gsub("X", "", colnames(SIDRA_Data))
In the chunk below, you need to create a vector and store the year’s information as an integer. You can do it:
Years <- as.integer(colnames(SIDRA_Data))
In the chunk below, you need to reproduce the routine presented above for the egg production values for the case of Brazil. However, unlike the previous case, you need to save the values as numerical. You can do it:
Eggs_Production_Brasil <- as.numeric(as.character(unlist(SIDRA_Data[1,], use.names = FALSE)))
In the chunk below, you need to reproduce the routine presented above for the egg production values for the state of Goiás, in the same way as for Brazil, saving the values as numeric. You can do it:
Eggs_Production_Goias <- as.numeric(as.character(unlist(SIDRA_Data[2,], use.names = FALSE)))
Now you can access this data and create a new DataFrame(DF). You can do it:
SIDRA_DF <- data.frame(Eggs_Production_Brasil, Eggs_Production_Goias, Years)
In the chunk below, you can visualize the new DF.
View(SIDRA_DF)
In the chunk below, you can plot a barplot of Brazil’s egg production.
ggplot(SIDRA_DF, aes(Years, Eggs_Production_Brasil)) +
geom_bar(stat = "identity", position = 'dodge') +
labs(x= "Years", y= "Eggs Production (Mil duzias)") +
theme_minimal()
Note that the way the values on the Y axis are presented, which refer to egg production in Brazil (a thousand dozen), makes it difficult to understand the data. The graph needs to be organized in such a way as to make it easy to read, so that the information can be extracted without any further effort being required. So, I’m going to convert the values from thousands to millions (of dozen eggs) by dividing the value of each record by 1,000,000.
In the chunk below, I will define the parameters for customizing the barplot of Brazil’s egg production. I need a plot that looks a bit better.
windowsFonts(TimesNewRoman = windowsFont("Times New Roman"))
tema <-theme(text=element_text(family="TimesNewRoman"),
title=element_text(color="black",size=12),
axis.text = element_text(color="black",size=13),
axis.title = element_text(color="black",size=13),
panel.grid=element_line(color="gray",linetype = "dotted",size=0.5),
axis.line=element_blank(),
plot.background=element_rect(fill="white",color="white"),
panel.background=element_rect(fill="white"),
panel.border = element_rect(colour = "black", fill = NA,size=0.59),
legend.key= element_rect(color="white",fill="white"))
In the chunk below, you can plot the historical series of egg production in Brazil (millions of dozen) also taking into account the customization parameters already defined.
ggplot(SIDRA_DF, aes(Years, Eggs_Production_Brasil/1000000)) +
geom_bar(stat = "identity", position = 'dodge') +
labs(title="Egg production (millions of dozen) in Brazil: 1974-2022", x= "Years", y= "Eggs Production (millions of dozen)") +
tema
In fact, we can see from the data provided by the IBGE that egg production in Brazil is following a trend of exponential growth, since 2010.
An background question for you:
Who eats so many eggs?
In the chunk below, you can plot the historical series of egg production in Goias (thousands of dozen) also taking into account the customization parameters already defined.
ggplot(SIDRA_DF, aes(Years, Eggs_Production_Goias/1000)) +
geom_bar(stat = "identity", position = 'dodge') +
labs(title="Egg production (thousands of dozen) in Goias state: 1974-2022", x= "Years", y= "Eggs Production (thousands of dozen)") +
tema
As has been observed throughout Brazil, we can also see a trend towards exponential growth in the state of Goiás.
Background questions for you:
Why are so many eggs produced in Goiás? why is production increasing? and who consumes so many eggs?
Thank you!
If you like it and want to contribute:
pix: izaiasdesouzasilvaa@gmail.com↩︎